All data sets used in the case studies are at: https://github.com/yichenqin/dataviz/
For each question in each case study, your answer should include: R code, R output, and interpretation of the R output. Below is an example.
dim() to find out of the number
of observations.d = read.csv("college.csv")
dim(d)[1]
## [1] 1269
You can use either Microsoft Word software or Rmarkdown to prepare your report. To use Microsoft Word, just copy and paste R code and output from RStudio to Word.
Please submit your report (in doc/html) to Canvas before the deadline.
Note that 20% of the total grade depend on how closely your answers follow the visualization principles and requirement checklist.
Questions
(5 points) Create a working directory, and download all three
data files from Canvas (CVG_Flights.csv,
airlines.csv, and airports.csv) to the
directory. Read these files into R as three data frames. Here is the
code sample/template.
flights = read.csv("CVG_Flights.csv", header = TRUE, na.strings = "")
Note that na.strings = "" turns blank "" to
NA.
(10 points) How many rows and columns are there in each data frame? What does each row represent (a plane, an airport, a flight, or an airline company)? What does each column represent? Explain the meanings of variables to the best of you understanding.
(5 points) Merge all three data frames into one data frame
according to the IATA code of airlines and airports. For
airports.csv, please merge it according to both the origin
and destination airports in CVG_Flights.csv, which means
you need to merge twice. Here is the code sample/template.
merged_data <- left_join(flights_data, airlines_data, by=c("AIRLINE"="IATA_CODE"))
(5 point) For this merged data set, print the first six rows.
(5 points) For this merged data set, are there any missing values? In what variables are these missing values? What is the percentages of missing values for each variable (i.e., the number of missing values divided by the total number of observations)?
(10 points) What is the proportion of canceled flights (to all flights)? How many different cancellation reasons are there?
(10 points) For DEPARTURE_TIME, are there missing
values? Do we know why these values are missing? Hint: canceled
flights?
(10 points) In the merged data frame, create a new variable
(i.e., new column) as the time difference between the
SCHEDULED_TIME and the ELAPSED_TIME, i.e.,
SCHEDULED_TIME - ELAPSED_TIME. Print the first
six elements of the new variable.
(10 points) Extract the observations (i.e., rows) with
AIRLINE of Delta and ORIGIN_AIRPORT of
Cincinnati/Northern Kentucky International Airport, and
DEPARTURE_DELAY time larger than 30 minutes, and put these
observations into a new data frame. Print the first six flight numbers
of the new data frame.
(10 points) Use group_by() and
summarize() to compute the average departure delay time for
different airlines. Which airline has the longest and shortest average
department delay?
(10 points) Use group_by() and
summarize() to compute the average departure delay time for
different ORIGIN_AIRPORT. Sort these airports descendingly
according to the average departure delay time and print the top six
rows, i.e., top six airports and their average delay times. Which
ORIGIN_AIRPORT has the longest and shortest average
department delay?
(10 points) For flights departing from CVG airport, count how many flights are offered by each airline. Print the entire list.
In this homework, we will learn from the pioneer in data visualization, Hans Rosling, and try to recreate one of his visualization. Watch the Hans Rosling’s presentation and take a look at his gapminder website which shows the same visualization but with higher resolution images.
Questions
1 (100 points). Please replicate the Hans Rosling’s visualization as closely as possible using ggplot. You only need to select one year to replicate. Try your best to replicate the symbol colors, shapes, sizes, axes, ticks, labels, text, grids, background colors, background text, and etc. Of course, it is impossible to replicate everything exactly the same. The visualization below should be your target.
A slightly different version of the data in this visualization is
available in the R package gapminder. You can install the
package by using install.packages("gapminder") and load the
data using data(gapminder). Below is an acceptable example
of the replication based on the gapminder package.
Useful ggplot Functions
+ annotate( geom="text", x=... , y=... , label=... , color=... , size=... , alpha=... )+ geom_point( ... , shape = 21) Note that if you use
shape=21, you need to use fill = ... to
specify the color inside the circle instead of
color = ....+ geom_point(...... , alpha = 0.xxx )+ scale_fill_manual( breaks = c("asia", ...... ) , values = c("pink", ......) ).
Alternatively, you can also use :
+ scale_color_manual( same arguments )+ scale_y_continuous( breaks = ... , labels = ... ) and
+ scale_x_continuous( breaks = ... , labels = ... ).+ scale_y_log10( ... ) and
+ scale_x_log10( ... ).+ scale_size( range = ... ).panel.background = element_blank() inside
+theme().axis.line = element_line(color = "black") inside
+theme().panel.grid.major = element_line(color = "grey") inside
+theme().Grading Instructions
The grade depends on:
For this case study, we will analyze a data set on college admission
in college.csv. In the data set, each row represents one
university and each column represents one variable. The continuous
variables are admission_rate, sat_avg, undergrads, tuition,
faculty_salary_avg, loan_default_rate, median_debt, lon, and lat. The
categorical variables are name, city, state, region, highest_degree,
control, and gender.
Questions
1 (5 points). For college.csv, how many variables and
how many observations are there in the data? Are there missing values in
the data?
2 (10 points). Pick one continuous variables and visualize its distribution. Pick one categorical variables and visualize its distribution. For continuous variables, you can choose from histogram, density plot, violin, and many others. For categorical variables, you can choose from barplot and many others. Describe what you observe in the visualization.
3 (15 points). Pick three pairs of variables (i.e., continuous vs continuous, continuous vs categorical, and categorical vs categorical). For each pair of variables, visualize the association between them. Make sure there are some meaningful patterns in your visualization. Describe what you observe in the visualization.
4 (10 points). Visualize the association between a pair of variables (of your choice) conditional on a third variable (of your choice). Make sure there are some meaningful patterns in your visualization. Describe what you observe in the visualization. This is similar to the previous questions but your visualization involves three variables. You can often use color, shape, size, facet to represent the third variable.
5 (10 points). Visualize the association/interaction among four variables (of your choice). Make sure there are some meaningful patterns in your visualization. Describe what you observe in the visualization. This is similar to the previous questions but your visualization involves four variables. You can often use color, shape, size, facet to represent the third and fourth variables.
6 (50 points). Propose two questions you are interested in about this data set, and then answer these questions using visualization. Some example questions can be:
Please propose your own questions and do not use the exactly same questions listed above. Note that this question is similar to the questions for final project.
For this case study, we will analyze a data set
country_stat.csv on different countries’s GDP, population,
life expectancy, infant mortality rate, fertility rate, continent, and
region, measured over years. Each row represents one country in a
particular year. Continuous variables include year, GDP, population,
life expectancy, infant mortality rate, fertility rate. Discrete
variables include continent and region.
Questions
(10 points) Are there missing values in the data? If so, can you show how data is missing? (open-ended question)
(5 points) How many unique countries are included in the data? How many years of observations are included in the data?
(5 points) In the data, create a new variable called
GDP_per_capita which equals to
GDP/population.
(80 points) Propose four questions you would like to know about this data (20 points per question). At least one question needs to be related to time series and be answered using time series data visualization. Some example questions can be: Does the developing countries grow slower than the developed countries? Is Africa catching up with world or left behind? Is the world more divided now than it was 50 years ago? Please propose your own questions and do not use the exactly same questions listed above.
For this case study, we will analyze three data sets on CVG flights,
CVG_Flights.csv, airlines.csv, and
airports,csv, and prepare a report summarizing your
results. In CVG_Flights.csv, each row represents one flight
either to or from CVG during 2015 January to March. Columns/variables
include flight’s information such as date, flight number, delay time,
origin and destination airports, departure time, air time, distance, and
cancellation. In airlines.csv, each row represents one
airline company. Columns/variables include airline names. In
airports.csv, each row represents one airport.
Columns/variables include airport names, city, state, longitude and
latitude.
Questions
For CVG_Flights.csv, how many variables and how many
observations in the data? Are there missing values in the data? If so,
can you show how data is missing?
For each variable in the data set, please describe what you observe, such as some summary statistics, their distributions, and etc.
Visualize the association between two variables of your choice. Check to see if there is an interesting relationship worth mentioning. If so, you can explore further and visualize what you have found.
Visualize the association between some variable pairs (of your choice) conditional on some other variables (of your choice). This is similar to the previous questions but your visualization involves more than two variables.
Merge all three data sets CVG_Flights.csv,
airlines.csv, and airports,csv according to
the IATA code for airlines and airports. This is the same as one of the
questions in case study 1.
Based on the merged data set (i.e., merge
CVG_Flights.csv, airlines.csv, and
airports,csv by the airline and airport IATA codes),
propose four questions you would like to know about
this data. Then answer these questions using visualization. One
of these questions need to be related to time series data
visualization. Another one of these questions need to
be related to spatial data visualization. Some example
questions can be: Does CVG offer more flights to east coast than west
coast? Which region in USA involve more delay in flights? How are the
airport distributed around USA. Do the average delay time or
cancellation rate change from week to week? Please propose your own
questions and do not use the exactly same questions listed
above.
In this homework, we will first learn from the pioneer and visionary in data visualization, Florence Nightingale, and try to replicate and improve upon one of her visualizations, Nightingale’s rose chart. Nightingale revolutionized nursing and was also a mathematician who knew the power of a visible representation of information. Below is the Nightingale’s rose chart. Please see the link here for detailed description of the background information.
Questions
1 (10 points). For the Nightingale’s rose chart visualization, what are the strengths of this visualization? What do you like about this visualization? Note that this was generated before all the modern technology became available. In addition, what are some of the weakness in this visualization?
2 (30 points). Please replicate this visualization as closely as
possible using R. Obviously, it is difficult to replicate everything
single detail in the visualization. You should try to replicate as much
as you can. The data for this visualization can be downloaded at https://github.com/yichenqin/dataviz/blob/main/data/Nightingale.RData
Click “Download” to download the RData file, and use
load("Nightingale.RData") function to load the data into R.
You can use the variables of rates, i.e.,
Disease.rate,Wounds.rate, and
Other.rate. Here are some useful ggplot functions
+geom_bar(width = 1,stat="identity",position = "identity", alpha=0.xxx)+geom_text(data = ..., aes(x=..., y=..., label=...), size=...).+coord_polar()+scale_fill_manual(values = c("pink",...))+ scale_x_log10(), + scale_x_sqrt(),
+ scale_y_log10(), and + scale_y_sqrt().4 (60 points). Try to improve upon this visualization based on your identified weaknesses by creating a new visualization. It can be of different visualization types.
Submission and Grading Instructions
The grade of question 2 depends on
The grade of question 3 depends on